Sales Analysis and forcasting for Sneakers market


Hello everyone and welcome again.

As part of a personal process, I decided to analyze and decipher several databases on subjects that I'm interrested in, and to share with you the results I obtained.

Today we're gonna analyse a market that I've been personnally investing in : the Sneakers market. This market has grown so much over the past decade. Statista's recently published a report on the Sneakers market indicating that the "global Sneakers" reached $127.3 billion of market capitalisation by 2021.

The current U.S. sneaker resale market is estimated at 2 billion dollars, but is expected to explode 15 times to 30 billion dollars by 2030. One of the main reasons: value appreciation. Sneakers are increasingly seen as an alternative asset.

The expansion of this market has witnessed the groth of new actors such as GOAT or StockX.

Earlier in 2021, resale marketplace app GOAT raised 195 million dollars in venture capital funding for a 3.7 billion dollars valuation. The company claimed 2 billion dollars in sales on the platform from mid-2020 to July 2021, with sneaker sales doubling.

Like GOAT, resale platform StockX was able to raise capital at a $3.8 billion valuation with IPO prospects later this year.

Today, we're going to anlyse a StockX Dataset and try to make predictions on this dataset. Here are the questions we're gonna try to answer :

  • What shoes are most popular?
  • Which shoes have the best/worst profit margins?
  • What factors affect profit margin?
  • Is it possible to predict the sale price of a shoe at a given time?

First, let's try import the packages we're gonna need :

Imports

In [ ]:
!pip install --upgrade matplotlib
In [ ]:
!pip install colour
!pip install dython
import warnings
warnings.filterwarnings('ignore')
In [9]:
from colour import Color
from collections import OrderedDict
import pandas as pd
import numpy as np  # linear algebra
import itertools  # iteration utils
from scipy.interpolate import griddata  # for 3d surface plot
import pickle
import plotly.express as px
import plotly.graph_objects as go
from plotly import subplots
import plotly.io as pio

Data Loading

In [10]:
df = pd.read_csv("StockX-Data-Contest-2019-3.csv")
In [11]:
df.head()
Out[11]:
Order Date Brand Sneaker Name Sale Price Retail Price Release Date Shoe Size Buyer Region
0 9/1/17 Yeezy Adidas-Yeezy-Boost-350-Low-V2-Beluga $1,097 $220 9/24/16 11.0 California
1 9/1/17 Yeezy Adidas-Yeezy-Boost-350-V2-Core-Black-Copper $685 $220 11/23/16 11.0 California
2 9/1/17 Yeezy Adidas-Yeezy-Boost-350-V2-Core-Black-Green $690 $220 11/23/16 11.0 California
3 9/1/17 Yeezy Adidas-Yeezy-Boost-350-V2-Core-Black-Red $1,075 $220 11/23/16 11.5 Kentucky
4 9/1/17 Yeezy Adidas-Yeezy-Boost-350-V2-Core-Black-Red-2017 $828 $220 2/11/17 11.0 Rhode Island

Features ingeneering

Currently the dataset consists of the single file of sales provided by StockX. ~10000 shoe sales from 50 different models, yeezys and Nike X Off_white.

The name of the features is pretty obvious and doesn't need explaination. We could be The first thing we are going to do is to add new features that we're gonna use later in our study:

The first feature that can be interresting to have is the color of the shoe. The color of a product is indeed often a criteria for customer, accordingly to their preferences. The color is indicated in the "Sneaker Name" feature, but not clearly identified as an independant feature.

In [12]:
def check_color(color):
    try:
        Color(color)
        return True
    except ValueError:
        return False

colors = []

for i in df["Sneaker Name"]:
  color = ""
  words = i.split("-") 
  for word in words:
    if check_color(word):
      color+=(" "+word)
  colors.append(color)

df["Colors"] = colors
In [13]:
df.head()
Out[13]:
Order Date Brand Sneaker Name Sale Price Retail Price Release Date Shoe Size Buyer Region Colors
0 9/1/17 Yeezy Adidas-Yeezy-Boost-350-Low-V2-Beluga $1,097 $220 9/24/16 11.0 California
1 9/1/17 Yeezy Adidas-Yeezy-Boost-350-V2-Core-Black-Copper $685 $220 11/23/16 11.0 California Black
2 9/1/17 Yeezy Adidas-Yeezy-Boost-350-V2-Core-Black-Green $690 $220 11/23/16 11.0 California Black Green
3 9/1/17 Yeezy Adidas-Yeezy-Boost-350-V2-Core-Black-Red $1,075 $220 11/23/16 11.5 Kentucky Black Red
4 9/1/17 Yeezy Adidas-Yeezy-Boost-350-V2-Core-Black-Red-2017 $828 $220 2/11/17 11.0 Rhode Island Black Red

This obviously works when the color is in the sneakers name, but not for all of them, let's check how many blanks we have :

In [14]:
len(df.loc[df["Colors"]==""])
Out[14]:
46314

Wih this method, we have 46314 blanks. That's half of our dataset. To resolve this problem, we're gonna scrap the colors directly from the StockX website. For this, we need a list of all different sneakers in our dataset. Here is our list :

In [15]:
df.pop('Colors')
df["Sneaker Name"] = df["Sneaker Name"].str.replace("-"," ")
allnames = df["Sneaker Name"].unique()

I won't detail here how I did to scrap the data on StockX, but I published a Medium article that you can access here. You can also get the whole code on my github by clicking this. What I'll do right now will be to load the dictionnary that contains all the sneakers colors and get the main colors out of it.

In [17]:
with open('saved_dictionary.pkl', 'rb') as f:
    final_dict = pickle.load(f)
In [ ]:
for key in final_dict.keys():
  final_dict[key] = final_dict[key].replace("/"," ")
  final_dict[key] = final_dict[key].replace("-"," ")
print(final_dict)
In [19]:
for i in final_dict.keys():
  color = ""
  words = final_dict[i].split(" ") 
  for word in words:
    if check_color(word):
      color+=(" "+word)
  color = ' '.join(OrderedDict.fromkeys(color.split())).split(" ")[0]
  final_dict[i] = color
In [20]:
print(final_dict)
{'Adidas Yeezy Boost 350 Low V2 Beluga': 'GREY', 'Adidas Yeezy Boost 350 V2 Core Black Copper': 'BLACK', 'Adidas Yeezy Boost 350 V2 Core Black Green': 'BLACK', 'Adidas Yeezy Boost 350 V2 Core Black Red': 'BLACK', 'Adidas Yeezy Boost 350 V2 Core Black Red 2017': 'BLACK', 'Adidas Yeezy Boost 350 V2 Core Black White': 'BLACK', 'Adidas Yeezy Boost 350 V2 Cream White': 'WHITE', 'Adidas Yeezy Boost 350 V2 Zebra': 'WHITE', 'Adidas Yeezy Boost 350 Low Moonrock': 'GRAY', 'Nike Air Max 90 Off White': '', 'Nike Air Presto Off White': 'WHITE', 'Nike Air VaporMax Off White': 'BLACK', 'Air Jordan 1 Retro High Off White Chicago': 'WHITE', 'Nike Blazer Mid Off White': 'BLACK', 'Adidas Yeezy Boost 350 Low Pirate Black 2016': 'BLACK', 'Adidas Yeezy Boost 350 Low Oxford Tan': 'TAN', 'Adidas Yeezy Boost 350 Low Turtledove': 'BLUE', 'Adidas Yeezy Boost 350 Low Pirate Black 2015': 'BLACK', 'Adidas Yeezy Boost 350 V2 Semi Frozen Yellow': 'YELLOW', 'Nike Air Force 1 Low Off White': 'GREEN', 'Nike Air Max 97 Off White': 'WHITE', 'Nike Air Force 1 Low Virgil Abloh Off White AF100': 'WHITE', 'Nike React Hyperdunk 2017 Flyknit Off White': 'WHITE', 'Nike Zoom Fly Off White': 'PINK', 'Adidas Yeezy Boost 350 V2 Beluga 2pt0': 'GREY', 'Adidas Yeezy Boost 350 V2 Blue Tint': 'BLUE', 'Nike Air VaporMax Off White 2018': 'BLACK', 'Air Jordan 1 Retro High Off White White': 'WHITE', 'Nike Air VaporMax Off White Black': 'BLACK', 'Air Jordan 1 Retro High Off White University Blue': 'WHITE', 'Nike Air Presto Off White Black 2018': 'BLACK', 'Nike Air Presto Off White White 2018': 'BLACK', 'Nike Zoom Fly Mercurial Off White Black': 'BLACK', 'Nike Zoom Fly Mercurial Off White Total Orange': 'ORANGE', 'adidas Yeezy Boost 350 V2 Butter': '', 'Nike Air Max 97 Off White Elemental Rose Queen': 'BLACK', 'Nike Blazer Mid Off White All Hallows Eve': 'ORANGE', 'Nike Blazer Mid Off White Grim Reaper': 'BLACK', 'Adidas Yeezy Boost 350 V2 Sesame': '', 'Nike Blazer Mid Off White Wolf Grey': 'GREY', 'Nike Air Max 97 Off White Menta': 'WHITE', 'Nike Air Max 97 Off White Black': 'BLACK', 'Nike Zoom Fly Off White Black Silver': 'BLACK', 'Nike Zoom Fly Off White Pink': 'PINK', 'Nike Air Force 1 Low Off White Volt': 'BLACK', 'Nike Air Force 1 Low Off White Black White': 'BLACK', 'adidas Yeezy Boost 350 V2 Static': 'BLACK', 'adidas Yeezy Boost 350 V2 Static Reflective': 'BLACK', 'Nike Air Max 90 Off White Black': 'BLACK', 'Nike Air Max 90 Off White Desert Ore': ''}
In [21]:
colors_list = []
for i in df["Sneaker Name"]:
  colors_list.append(final_dict[i])
df["Colors"] = colors_list

We still have a few sneakers that dont have a corresponding color:

In [22]:
df["Sneaker Name"].loc[df["Colors"]==""].unique()
Out[22]:
array(['Nike Air Max 90 Off White', 'adidas Yeezy Boost 350 V2 Butter',
       'Adidas Yeezy Boost 350 V2 Sesame',
       'Nike Air Max 90 Off White Desert Ore'], dtype=object)

Sans titre (1).png

For these 4 models, we will just fill by hand the corresponding color.

In [23]:
final_dict['Nike Air Max 90 Off White'] = 'WHITE'
final_dict['adidas Yeezy Boost 350 V2 Butter'] = 'BEIGE'
final_dict['Adidas Yeezy Boost 350 V2 Sesame'] = 'GREY'
final_dict['Nike Air Max 90 Off White Desert Ore'] = 'TAN'
colors_list = []
for i in df["Sneaker Name"]:
  colors_list.append(final_dict[i])
df["Colors"] = colors_list

Now that we have our colors, let's create a new column that will represent the capital gain of our sneakers. This new feature will help us quantify how much value a certain sneaker gained through time.

In [24]:
df['Sale Price'] = df['Sale Price'].str.replace("$","")
df['Sale Price'] = df['Sale Price'].str.replace(",","")
df['Retail Price'] = df['Retail Price'].str.replace("$","")
df['Retail Price'] = df['Retail Price'].str.replace(",","")
df['Retail Price'] = df['Retail Price'].astype(float)
df['Sale Price'] = df['Sale Price'].astype(float)
In [25]:
df["Margin"] = df['Sale Price']-df["Retail Price"]
df["Value_increase(%)"] = (df['Sale Price']*100) /df["Retail Price"]

Another feature that could be interesting to add would be the nuber of days that passed between the order date and the realease date. This will indicate if it's better for a customer to buy a sneaker when it drops or later, and how the value fluctuate through time.

In [26]:
df['Order Date']= pd.to_datetime(df['Order Date'])
df['Release Date']= pd.to_datetime(df['Release Date'])
df["Days_between_OR"] = df['Order Date']-df['Release Date']
In [27]:
df
Out[27]:
Order Date Brand Sneaker Name Sale Price Retail Price Release Date Shoe Size Buyer Region Colors Margin Value_increase(%) Days_between_OR
0 2017-09-01 Yeezy Adidas Yeezy Boost 350 Low V2 Beluga 1097.0 220.0 2016-09-24 11.0 California GREY 877.0 498.636364 342 days
1 2017-09-01 Yeezy Adidas Yeezy Boost 350 V2 Core Black Copper 685.0 220.0 2016-11-23 11.0 California BLACK 465.0 311.363636 282 days
2 2017-09-01 Yeezy Adidas Yeezy Boost 350 V2 Core Black Green 690.0 220.0 2016-11-23 11.0 California BLACK 470.0 313.636364 282 days
3 2017-09-01 Yeezy Adidas Yeezy Boost 350 V2 Core Black Red 1075.0 220.0 2016-11-23 11.5 Kentucky BLACK 855.0 488.636364 282 days
4 2017-09-01 Yeezy Adidas Yeezy Boost 350 V2 Core Black Red 2017 828.0 220.0 2017-02-11 11.0 Rhode Island BLACK 608.0 376.363636 202 days
... ... ... ... ... ... ... ... ... ... ... ... ...
99951 2019-02-13 Yeezy adidas Yeezy Boost 350 V2 Static Reflective 565.0 220.0 2018-12-26 8.0 Oregon BLACK 345.0 256.818182 49 days
99952 2019-02-13 Yeezy adidas Yeezy Boost 350 V2 Static Reflective 598.0 220.0 2018-12-26 8.5 California BLACK 378.0 271.818182 49 days
99953 2019-02-13 Yeezy adidas Yeezy Boost 350 V2 Static Reflective 605.0 220.0 2018-12-26 5.5 New York BLACK 385.0 275.000000 49 days
99954 2019-02-13 Yeezy adidas Yeezy Boost 350 V2 Static Reflective 650.0 220.0 2018-12-26 11.0 California BLACK 430.0 295.454545 49 days
99955 2019-02-13 Yeezy adidas Yeezy Boost 350 V2 Static Reflective 640.0 220.0 2018-12-26 11.5 Texas BLACK 420.0 290.909091 49 days

99956 rows × 12 columns

Now let's have a look at our features, their repartition and how they're related to each other.

Data Visualisation

In [28]:
px.bar(df, x='Order Date', y='Sale Price', color='Brand', color_discrete_map={'Yeezy':'black','Off-White': 'green'}, title='Volume of sales by sneaker brand')

This graph shows us the volume of sales by brand. It can also be interpreted as an indicator of performance of each brand for the resale ability, or liquidity. The bigger the value, the better a pair of sneakers is sold. We can see that globally, the volume of Off White sneakers sold is way bigger than the volume of yeezy sneakers sold. We can also see 2 significant peaks on these graphs. They both happened between November and january. This is not surprising since it's corresponding to the christmas and new year period. Be careful tho, this is not a profitability indicator. Indeed, it's not because a pair of shoes was sold more that the margin is bigger. Let's evaluate how the profitability evolves through time:

In [29]:
px.bar(df, x='Order Date', y='Margin', color='Brand', color_discrete_map={'Yeezy':'black','Off-White': 'green'},title='Sum of profitability by sneaker brand')

Again, we can see that the margins are bigger in the end of the year. We can also see that the margin made on OFF WHITE shoes is bigger than the margin made on YEEZYs. Selling OFF WHITE in the end of the year seems to be an interresting investment choice.

In [30]:
table = df.groupby(['Buyer Region'])['Sneaker Name'].count().reset_index()
# importing geopy library
from geopy.geocoders import Nominatim
# calling the Nominatim tool
loc = Nominatim(user_agent="GetLoc")
# entering the location name
getLat = [loc.geocode(region).latitude for region in table["Buyer Region"]] 
getLong = [loc.geocode(region).longitude for region in table["Buyer Region"]] 
table["Lat"] = getLat
table["Long"] = getLong
In [31]:
fig = px.scatter_geo(
    table, lon = table['Long'], lat = table['Lat'],color="Buyer Region",
    size="Sneaker Name", hover_name="Buyer Region",scope = 'usa', color_continuous_scale=[(0, "white"), (1, "green")]
)
fig.show()

Let's now have a look on the color variable we created. To what extenr do we have different margins depending on the color of the sneaker?

In [32]:
margin_per_col = df.groupby("Colors").sum().reset_index()
fig = px.pie(
    margin_per_col, values='Margin', names='Colors',color="Colors",
    hole=0.5, color_discrete_map=dict(zip(df.Colors, df.Colors)))
fig.update_layout(height=500, title='Color of the sneakers sold')
fig.show()

It seems like the biggest margins are made with neutral tones such as Black, White or grey colors. We can also see that unusual colors such as green or pink seem to be less profitable.

As a reminder, the 2 questions we are trying to answer are the following :

  • What factors affect profit margin?
  • Is it possible to predict the sale price of a shoe at a given time?

Now that we have an idea on how our features behave. We're going to modelize the margin to see what features have an impact, and we're gonna make predictions on the sale price accordingly to the time. Let's see which features are to be taken in our models.

What factors affect profit margin?

Correlation analysis

Now that we have all our features, let's try to determine which features are the most significant for a good profit margin. Let's start by computing the correlation between our features, which is a measure of association between two features, wheter they're categorical or numerical. The features that have a too high or low correlation will be removed.

As you can see, we have categorical and numerical features in our dataset. This means we have to use specific methods in order to calculate the correlation between 2 features. The first method we could use is the cramer's V, which is based on a nominal variation of Pearson’s Chi-Square Test. Unfortunately, this method implies to loose part of the information since it's a symetrical method. This means that, given 2 variables x and y, cramers_v(x,y) = cramers_v(y,x). But it is not always true.

We need an asymmetric measure of association between categorical/Numerical features. And this is exactly what Theil’s U is. Theil’s U, also referred to as the Uncertainty Coefficient, is based on the conditional entropy between x and y — or in human language, given the value of x, how many possible states does y have, and how often do they occur.

In [33]:
import numpy as np
import scipy.stats as ss
import seaborn as sns
from dython.nominal import conditional_entropy
from collections import Counter

def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x,y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

def theils_u(x, y):
    s_xy = conditional_entropy(x,y)
    x_counter = Counter(x)
    total_occurrences = sum(x_counter.values())
    p_x = list(map(lambda n: n/total_occurrences, x_counter.values()))
    s_x = ss.entropy(p_x)
    if s_x == 0:
        return 1
    else:
        return (s_x - s_xy) / s_x

corr = pd.DataFrame()
corr["Features"] = df.columns

for col in df.columns:
  thelis_u_list = []
  for ft in corr["Features"]:
    thelis_u_list.append(theils_u(df[col], df[ft]))
  corr[col] = thelis_u_list

corr = corr.set_index("Features")

Here's a heatmap that will help us visualize the correlations :

In [34]:
fig = px.imshow(corr, color_continuous_scale=[(0, "white"), (1, "green")])
fig.show()

We can see here that the features that seem to have the biggest impact on the profitability seem to be the sneaker name, the brand, the retail price, the release date, and the colors. On the opposite, the shoe size, the buyer Region and the order date don't have a big influence on the profitability.

Feature Importance with random forest classifier

Once we have trained a model it is possible to apply further statistical analysis to understand the effects features have on the output of the model and determine from this which features are most useful.

To illustrate feature importances I will implement a random forest classifier and use the built-in feature importances method in scikit-learn.

In [35]:
import datetime
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler


def normalize_df(df):
  scaler = MinMaxScaler()
  scaler.fit(df)
  scaled = scaler.fit_transform(df)
  scaled_df = pd.DataFrame(scaled, columns=df.columns)
  return scaled_df

def dummy_encoding(df):
  for col in df.columns:
    if df[col].dtype==object:
      df = pd.get_dummies(df, columns=[col], prefix = col, prefix_sep = "_")
  return df

def factorize(df):
  for col in df.columns:
    if df[col].dtype==object:
      df[col] = pd.factorize(df[col])[0]
  return df

target = "Margin"
rfc_df = df.copy()
#We want to consider the shoe size as an object
rfc_df['Shoe Size'] = rfc_df['Shoe Size'].astype(str)
#rfc_df['Retail Price'] = rfc_df['Retail Price'].astype(str)
rfc_df.pop('Order Date')
rfc_df.pop('Release Date')
rfc_df.pop('Days_between_OR')
rfc_df = dummy_encoding(rfc_df)
rfc_df = normalize_df(rfc_df)

from sklearn import preprocessing
from sklearn import utils

#convert y values to categorical values
lab = preprocessing.LabelEncoder()

# Spliiting data into test and train sets
X_train, X_test, y_train, y_test = train_test_split(rfc_df.drop(target, axis=1), lab.fit_transform(rfc_df[target]), test_size=0.20, random_state=0)
# fitting the model
model = RandomForestClassifier(n_estimators=50, n_jobs=-1, random_state=42, max_depth = 15)
model.fit(X_train, y_train)
Out[35]:
RandomForestClassifier(max_depth=15, n_estimators=50, n_jobs=-1,
                       random_state=42)
In [36]:
# plotting feature importances
features = rfc_df.drop(target, axis=1).columns
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10,15))
plt.title('Feature Importances')
plt.barh(range(len(indices[-22:-2])), importances[indices[-22:-2]], color='g', align='center')
plt.yticks(range(len(indices[-22:-2])), [features[i] for i in indices[-22:-2]])
plt.xlabel('Relative Importance')
plt.show()

Without considering the numerical variables such as resell price or the value increase, this graph shows us what modalities seem to be the most profitable. Here I took the 20 first modalities that have the biggest feature importance. We can see here that the Yeezy Boost 350 V2 model seem to be very profitable for StockX. Again let me explain the margin variable. It represents the sum of all margins made for all transaction. So if the margin is small but a lot of shoes are sold, the profitability will be big for StockX. We can also see that shoe size between 9 and 12 seem to make the biggest profit. Same for basic colors such as black or white, and states such as California or New York.

Is it possible to predict the sale price of a shoe at a given time?

In order to answer that question, we will group the sales by order date. We will use that new DataFrame to compute a time series forecasting.

In [37]:
df_ts = df.groupby("Order Date").sum()["Sale Price"].reset_index()
df_ts.head(3)
Out[37]:
Order Date Sale Price
0 2017-09-01 15019.0
1 2017-09-02 9744.0
2 2017-09-03 10759.0
In [38]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_ts["Order Date"], y=df_ts["Sale Price"], mode='lines', name='Sale Price',))
fig.data[0].line.color = "green"

fig.update_layout(
    title='Sale Price Performance',
    xaxis_title='Date', yaxis_title='Price'
)